So far every pathway analysis method we’ve covered relies on some information about groups in our data. For over-representation analysis (ORA), we took the top 100 genes that distinguished neurons from other cell types in a scRNA-seq experiment – it relied on cell type labels. In the Gene Set Enrichment Analysis (GSEA) example, we used statistics from a differential gene expression (DGE) analysis where we compared MYCN amplified cell lines to non-amplified cell lines; we needed that amplification status information.
What if we’re less sure about groups in our data or we want to analyze our data in a more unsupervised manner?
In this notebook we will cover a method called Gene Set Variation Analysis (GSVA) (Hänzelmann, Castelo, and Guinney. 2013) that allows us to calculate gene set or pathway scores on a per-sample basis.
We’ll use the
# Pipes
library(magrittr)
# Gene Set Variation Analysis
library(GSVA)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
# We have some medulloblastoma data from the OpenPBTA project that we've
# prepared ahead of time
input_dir <- file.path("data", "medulloblastoma")
# Create a directory specifically for our GSVA results if it does not yet exist
output_dir <- file.path("results", "gsva")
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
We have VST transformed RNA-seq data, annotated with gene symbols, that has been collapsed such that there are no duplicated gene identifiers (see setup).
rnaseq_file <- file.path(input_dir, "medulloblastoma_vst_collapsed.tsv")
gsva_results_file <- file.path(output_dir, "medulloblastoma_gsva_results.tsv")
The function that we will use to run GSVA wants the gene sets to be in a list, rather than a tidy data frame that we used with clusterProfiler (although it does accept other formats).
We’re going to take this opportunity to introduce a different format that gene sets are often distributed in called GMT (Gene Matrix Transposed).
We’re going to read in the Hallmark collection file directly from MSigDB, rather than using msigdbr like we did in earlier notebooks.
The RNA-seq data uses gene symbols, so we need gene sets that use gene symbols, too.
# R can often read in data from a URL
hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.1/h.all.v7.1.symbols.gmt"
# QuSAGE is another pathway analysis method, the qusage package has a function
# for reading GMT files and turning them into a list
hallmarks_list <- qusage::read.gmt(hallmarks_url)
What does this list look like?
head(hallmarks_list)
We have VST transformed RNA-seq data, which is on a log2-like scale. These data are from the Open Pediatric Brain Tumor Atlas (OpenPBTA) OpenPBTA is a collaborative project organized by the CCDL and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia conducted openly on GitHub.
You can read more about the project here.
We’re only working with the medulloblastoma samples in this example.
rnaseq_df <- readr::read_tsv(rnaseq_file)
Parsed with column specification:
cols(
.default = col_double(),
gene_symbol = [31mcol_character()[39m
)
See spec(...) for full column specifications.
# What does the RNA-seq data frame look like?
rnaseq_df[1:5, 1:5]
For GSVA, we need a matrix.
rnaseq_mat <- rnaseq_df %>%
tibble::column_to_rownames("gene_symbol") %>%
as.matrix()
Note: If we had duplicate gene symbols here, we couldn’t set them as rownames.
Figure 1 from Hänzelmann, Castelo, and Guinney. (2013).
You may notice that GSVA has some commonalities with GSEA. Rather than ranking genes based on some statistic, GSVA ranks genes based on their expression level relative to the sample distribution and calculates pathway-level scores. The inituition here is that we will get a pathway-level score that indicates if genes in a pathway that act concordantly in one direction (overexpressed or underexpressed relative to the overall population) (Hänzelmann, Castelo, and Guinney. (2013)).
The output is a gene set by sample matrix of GSVA scores.
gsva_results <- gsva(rnaseq_mat,
hallmarks_list,
method = "gsva",
# Appropriate for our transformed data on log2-like scale
kcdf = "Gaussian",
# Minimum gene set size
min.sz = 15,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Use 4 cores
parallel.sz = 4)
945 genes with constant expression values throuhgout the samples.Since argument method!="ssgsea", genes with constant expression values are discarded.
Estimating GSVA scores for 50 gene sets.
Computing observed enrichment scores
Estimating ECDFs with Gaussian kernels
Using parallel with 4 cores
|
| | 0%
Note: the gsva() documentation says we can use kcdf = "Gaussian" if we had RNA-seq log-CPMs, log-RPKMs or log-TPMs, but we would use kcdf = "Poisson" on integer counts.
# Let's explore what the output of gsva() looks like
gsva_results[1:5, 1:5]
BS_09Z7TC35 BS_1AYRM596 BS_1BWP5MCT BS_1QXEC43H
HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.44911439 0.5208667 -0.5609193 0.32300630
HALLMARK_HYPOXIA -0.38297104 0.2436910 -0.5058759 0.36247083
HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.26534735 0.1054224 -0.5180933 0.32418657
HALLMARK_MITOTIC_SPINDLE 0.12727006 0.2339489 -0.4338076 -0.07023068
HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.09287646 0.1602124 -0.4427594 0.41372523
BS_1TWCV047
HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.1468081
HALLMARK_HYPOXIA -0.2971559
HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.4561386
HALLMARK_MITOTIC_SPINDLE -0.1861483
HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.1152437
Often the scores of gene set enrichment methods are not comparable between gene sets of different sizes. Let’s do an experiment using randomly generated gene sets to explore this idea a bit more.
We need to get a collection of all possible genes we will sample from to create random gene sets. Because we’re doing some random sampling, we need to set a seed for this to be reproducible.
# Use all the gene symbols in the dataset as the pool of possible genes
all_genes <- rownames(rnaseq_mat)
# Set a seed for reproducibility
set.seed(2020)
Our minimum gene set size earlier was 15 genes and our maximum gene set size was 500 genes. We’ll use the same minimum and maximum values for our random gene sets and some values in between.
# Make a list of integers that indicate the random gene set sizes
gene_set_size <- list(15, 25, 50, 100, 250, 500)
For each gene set size, we will generate 100 random gene sets
# Set number of replicates
nreps <- 100
# Generate 100 random gene sets of each size
random_gene_sets <-
purrr::map(
rep(gene_set_size, nreps), # Repeat gene sizes so we run `nreps` times
# Sample the vector of all genes, choosing the number of items specified
# in the element of gene set size
~ base::sample(x = all_genes,
size = .x)
)
The Hallmarks list we used earlier stored the gene set names as the name of the list, so let’s add names to our random gene sets that indicate what size they are and so gsva() doesn’t get upset.
# We will include the size of the gene set in the gene set name
# Start by taking the length of each pathway and appending "pathway_" to that
# number
lengths_vector <- purrr::map(random_gene_sets, ~ length(.x)) %>%
purrr::map(~ paste0("pathway_", .x)) %>%
# Return a vector
purrr::flatten_chr()
# Add the names in lengths_vector to the list
random_gene_sets <- random_gene_sets %>%
# make.names() appends a "version" if something is not unique
purrr::set_names(nm = make.names(lengths_vector, unique = TRUE))
Run GSVA on our dataset with the same parameters as before, but now with random gene sets.
random_gsva_results <- gsva(rnaseq_mat,
random_gene_sets,
method = "gsva",
# Appropriate for our transformed data on
# log2-like scale
kcdf = "Gaussian",
# Minimum gene set size
min.sz = 15,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Use 4 cores
parallel.sz = 4)
945 genes with constant expression values throuhgout the samples.Since argument method!="ssgsea", genes with constant expression values are discarded.
Estimating GSVA scores for 579 gene sets.
Computing observed enrichment scores
Estimating ECDFs with Gaussian kernels
Using parallel with 4 cores
|
| | 0%
Now let’s make a plot to look at the distribution of scores from random gene sets. First we need to get this data in an appropriate format for ggplot2.
# The random results are a matrix
random_long_df <- random_gsva_results %>%
data.frame() %>%
# Gene set names are rownames
tibble::rownames_to_column("gene_set") %>%
# Get into long format
tidyr::pivot_longer(cols = -gene_set,
names_to = "Kids_First_Biospecimen_ID",
values_to = "gsva_score") %>%
# Remove the .version added by make.names()
dplyr::mutate(gene_set = stringr::str_remove(gene_set, "\\..*")) %>%
# Add a column that keeps track of the gene set size
dplyr::mutate(gene_set_size = stringr::word(gene_set, 2, sep = "_")) %>%
# We want to plot smallest no. genes -> largest no. genes
dplyr::mutate(gene_set_size = factor(gene_set_size,
levels = c(15, 25, 50, 100, 250, 500)))
Let’s make a violin plot so we can look at the distribution of scores by gene set size.
# Violin plot comparing GSVA scores of different random gene set sizes
random_long_df %>%
ggplot2::ggplot(ggplot2::aes(x = gene_set_size,
y = gsva_score)) +
# Make a violin plot that's a pretty blue!
ggplot2::geom_violin(fill = "#99CCFF", alpha = 0.5) +
# Add a point with the mean value
ggplot2::stat_summary(
geom = "point",
fun = "mean",
# Change the aesthetics of the points
size = 3,
color = "#0066CC",
shape = 18
) +
# Flip the axes
ggplot2::coord_flip() +
ggplot2::labs(title = "Random Gene Set GSVA Scores",
x = "gene set size",
y = "GSVA score") +
ggplot2::theme_bw()
What do you notice about these distributions? How might you use this information to inform your interpretation of GSVA scores?
If you did have groups information for your samples, you could test for pathway scores differences between groups. Here’s an example of that from the OpenPBTA project itself!
You can also visualize this matrix in a heatmap. Here’s a figure from the OpenPBTA project, where the middle panel is a heatmap of GSVA scores that were significantly different between histologies.
gsva_results %>%
as.data.frame() %>%
tibble::rownames_to_column("pathway") %>%
readr::write_tsv(path = gsva_results_file)
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GSVA_1.34.0 magrittr_1.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 mvtnorm_1.1-1 qusage_2.20.0
[4] lattice_0.20-38 tidyr_1.0.0 digest_0.6.25
[7] mime_0.9 R6_2.4.1 stats4_3.6.1
[10] RSQLite_2.2.0 ggplot2_3.3.0 pillar_1.4.4
[13] rlang_0.4.6 rstudioapi_0.11 annotate_1.64.0
[16] blob_1.2.0 S4Vectors_0.24.4 labeling_0.3
[19] shinythemes_1.1.2 readr_1.3.1 geneplotter_1.64.0
[22] stringr_1.4.0 munsell_0.5.0 RCurl_1.95-4.12
[25] bit_1.1-14 shiny_1.4.0.2 compiler_3.6.1
[28] httpuv_1.5.2 xfun_0.14 pkgconfig_2.0.3
[31] BiocGenerics_0.32.0 htmltools_0.4.0 tidyselect_1.1.0
[34] tibble_3.0.1 IRanges_2.20.2 XML_3.99-0.3
[37] crayon_1.3.4 dplyr_1.0.0 later_1.0.0
[40] bitops_1.0-6 grid_3.6.1 gtable_0.3.0
[43] nlme_3.1-140 xtable_1.8-4 GSEABase_1.48.0
[46] lifecycle_0.2.0 DBI_1.1.0 scales_1.1.0
[49] graph_1.64.0 estimability_1.3 stringi_1.4.6
[52] farver_2.0.3 promises_1.1.0 limma_3.42.2
[55] ellipsis_0.3.1 vctrs_0.3.1 generics_0.0.2
[58] RColorBrewer_1.1-2 tools_3.6.1 bit64_0.9-7
[61] Biobase_2.46.0 glue_1.4.1 purrr_0.3.4
[64] hms_0.5.3 emmeans_1.4.8 parallel_3.6.1
[67] fastmap_1.0.1 colorspace_1.4-1 AnnotationDbi_1.48.0
[70] memoise_1.1.0 knitr_1.28